Red Wine Chemistry Analysis by Tom McKenzie

Abstract

Aim of this project is to analyse the “wineQualityReds” dataset. This dataset contains 1599 observations over 13 variables that describe chemical aspects of a given red wine, including a qualitative rating of the wine taste/quality.

Load data

Downloaded the file as .csv and stored in project directory.

Read into dataframe object using read.csv() function.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      total.acidity   
##  Min.   : 8.40   Min.   :3.000   Min.   : 5.120  
##  1st Qu.: 9.50   1st Qu.:5.000   1st Qu.: 7.680  
##  Median :10.20   Median :6.000   Median : 8.445  
##  Mean   :10.42   Mean   :5.636   Mean   : 8.847  
##  3rd Qu.:11.10   3rd Qu.:6.000   3rd Qu.: 9.740  
##  Max.   :14.90   Max.   :8.000   Max.   :16.285

The dataset contains several distinct variables corresponding to chemical components. Some appear related (e.g. ‘free.sulfur.dioxide’ and ‘total.sulfur.dioxide’), while others at first glance appear distinct. There is a variable called ‘X’ which appears to be just a unique ‘id’ or index column. I’ll remove this from the dataset as it is unecessary..

Univariate Plots Section

Check out the histogram of “quality” ratings, as this is perhaps the variable of most interest in this investigation.

The quality ratings appear to be integer values only, with the vast majority given ratings of 5 - 7. Note, a binwidth of 1 is used as the values can only be integers.

Will now check out the other variables individually to get an idea (visually) of how they are distributed.

Variable: ‘fixed.acidity’; Plot: histogram.

After removing the outliers and converting the x-axis to a log scale a monomodal distribution is obtained with a peak around 7.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   7.100   7.900   8.281   9.200  13.800

The mean of this distribution is 8.281% and the median is 7.900%.

Variable: ‘volatile.acidity’; Plot: histogram.

After removing outliers, the distribution is approximately symmetrical even without log scaling. The may be two peaks present, but the count values are relatively low so it could just be an artefact of the small dataset.

Summary statistics for the volatile.acidity variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.120   0.390   0.520   0.519   0.630   0.980

The mean is 0.519%, and the meadian is 0.520%.

Variable: ‘total.acidity’; Plot: histogram.

The distribution of the calculated variable ‘total.acidity’ is approximately normal after a log transformation. This is not suprising as the major compenent of total.acidity is fixed.acidity, which required the same transformation.

Variable: ‘sulphates’; Plot: histogram.

After a log transformation this is now a very nice normal distribution, with a peak around 0.6%.

Summary statistics for the sulphates variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6487  0.7200  1.2200

The mean for sulphates is 0.6487%, with a median value of 0.6200%.

Variable: ‘alcohol’; Plot: histogram.

We see that even after log transformation the data is still strongly positively skewed. May be better transformed using a different function.

Summary statistics for the alcohol variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.4     9.5    10.1    10.4    11.1    13.6

The mean alcohol percentage is 10.4%, with a median value of 10.1%.

Variable: ‘pH’; Plot: histogram.

Even without log transformation this distribution appears roughly symmetrical, with a peak around a pH value of 3.3.

Summary statistics for the alcohol variable:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.210   3.310   3.309   3.400   3.740

The mean pH was 3.309, and the median was 3.310.

Univariate Analysis

What is the structure of your dataset?

The data is already in a relatively ‘tidy’ format. The only ‘cleaning’ that was applied was removal of the unecessary ‘X’ column (which was effectively an index).

However, the variable units are unknown. I have assumed that the chemical components like ‘acidity’ and ‘chlorides’ might be reported as a percentage, much like the ‘alcohol’ variable (which, based off my prior domain knowledge, is almost certainly in vol%). The ‘density’ and ‘pH’ variables are likely in ‘g/cm3’ and unitless, respectively, while the ‘quality’ rating is a basic 0 - 10 scale, with 0 being the “worst” and 10 the “best”. These units will be used for the rest of the analysis.

What is/are the main feature(s) of interest in your dataset?

I am most intested in seeing how the physico-chemical elements affect the “quality” rating, as given by a panel of 3 wine experts. I am also rather suprised at the acidity of wine, with 75% of the wines having a pH of less than 3.5.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think significant chloride and sulphate contents may negatively impact the rating given to a particular wine.

Did you create any new variables from existing variables in the dataset?

A variable called ‘total.acidity’ that is the sum of the ‘fixed.acidity’ and ‘volatile.acitity’ was created as a simplification for the effect of acidity.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The ‘alcohol’ percentage was skewed quite strongly. Even after applying a log10 scaling transformation the distribution was still long-tailed. Otherwise, the variables investigated were monomodal and approximately normally-distributed, perhaps as expected for these kinds of data and given the uniformity of the wine industry (e.g. no extreme outliers, as consumers would likely not appreciate compositions that differed markedly from what they were used to!).

Bivariate Plots Section

First, I’d like to create a scatterplot matrix on a sample of the data. I’ll save the generated plot as an image to refer to later.

It appears that the strongest bivariate correlation with ‘quality’ is ‘alcohol’, with a correlation coefficient of 0.476. Also, ‘quality’ is negatively correlated with ‘volatile.acidity’, with a coefficient of -0.391.

These both seem like reasonable observations. Some other logical correlations are the ‘acidity’/‘acid’ variables with ‘pH’ (e.g. ‘fixed.acidity’ and ‘pH’ (-0.683)), and the ‘alcohol’ with ‘density’ (-0.496), given that acids lower the pH and alcohol is less dense than water (the other primary constituent of wine).

Overall, there are not too many suprising correlations in the scatterplot matrix. However, I’d like to investigate the relationships with quality further.

Given the integer nature of the quality rating and the overploting on the scatterplot, this might actually be better as a structured boxplot.

It is quite clear from this boxplot that the higher quality wines have higher alcohol contents, with the median alcohol content for ratings of 7 and 8 both significantly higher than the other median values.

See what effect chlorides and sulphates have on rating.

First lets look at a box plot of the chloride content for different rating values.

There are many outliers for the middle rating values, however it appears that decreasing chloride content is better.

Next, let’s plot the same thing for sulphate content.

This is somewhat unexpected. Again, there is large variablility in the middle rating values, however it appears that higher contents of sulphates result in better quality ratings.

Now look at pH.

No clear relationship observed. There are lots of outliers, but most points are clustered around the median pH and quality values.

I want to see how the citric acid levels affect the pH.

This clearly indicates a negative linear relationship between citric acid and pH.

I want to see the effect on pH of all the acidity-based variables (including ‘total.acidity’). Will arrange scatterplots in a grid to see this.

This now clearly shows that the volatile.acidity does not effect the pH of the wine. This might also explain why, given the moderate negative correlation between volatile.acidity and quality observed in the scatterplot matrix, no relationship was observed in the scatterplot of pH and quality.

Can look more closely at the negative correlation between the ‘volatile.acidity’ and the wine quality with a simple boxplot.

It certainly appears that lowering the volatile acidity correlates with an improvement in the quality.

Does ‘residual.sugar’ correlate with ‘alcohol’ content?

I would initially assume that given fermentation converts sugars into alcohol, higher alcohol contents would lead to lower residual sugar contents. Although a few outliers with high residual sugar levels and low alcohol contents may indicate this effect slightly, perhaps it depends also on other factors like the amount of initially-added sugar, for which we don’t have the data.

Do the judges like sweet wines?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.703   6.000   8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Maybe! Although the medians values are both 6.0, the mean for the ‘sugary wines’ is slightly higher than that of the entire selection of wines (5.703 vs 5.636).

Check to see if ‘free.sulfur.dioxide’ correlates with ‘total.sulfur.dioxide’.

This does appear to be a positive correlation, but it becomes quite broad at higher values. Will take the sqrt() of both axes.

This transformed plot appears more uniformly linear.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

It was interesting to note the strongest correlations with ‘quality’ were alcohol content and the volatile acidity. Perhaps having a strongly acidic aroma is off-putting, affecting the percieved quality of the wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The relationship between residual.sugar and alcohol content was of interest to me. I would’ve assumed that higher alcohol contents resulted in lower residual.sugar values, however there was only a very weak realtionship. There may be other factors that go into determining the levels of residual.sugar that we don’t have the data for.

What was the strongest relationship you found?

The relationships between acidity/acid content and the pH were obviously very strong. However, in terms of the ‘quality’ variable of interest, the strongest correlation was with alcohol content, with higher alcohol content wines receiving higher quality ratings.

Multivariate Plots Section

Will plot alcohol vs quality, with color-coding for the volatile.acidity variable to see if any obvious trend exists. Will scale the achohol axis to log10 to try account for its skewed distribution.

It is hard to distinguish any trends based on the color-coding in this plot. The individual boxplots showed the correlation more clearly.

Will try adding the ‘quality’ as a third variable to our sulfur.dioxide plot from earlier.

Again this is quite difficult to see a trend. The quality appears to not be correlated with sulfur.dioxide content. This seems reasonable given their low correlation in the scatterplot matrix (< 0.3).

Wanted to look at quality with volatile.acidity, and see if a relationship with citric.acid content was also present.

Although there does not appear to be a vertical relationship for citric.acid (i.e. with quality), a horizontal trend is apparent, with lower volatile.acidity’s having higher citric.acid contents (darker points). This was unexpected - i.e., higher citric.acid contents correspond to lower volatile.acidity. It may be because citric.acid is non-volatile, and so if there is a higher content of it there is a correspondingly lower content of more volatile acids.

Lastly, I wanted to confirm the correlation between density and alcohol content.

This plot demonstrates the negative correlation between density and alcohol, although the effect of these variables on the quality is hard to discern.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The combined effect of volatile.acidity and alcohol on the quality rating was difficult to clearly illustrate using multivariate analysis.

It was nice to visualize the clear effect of alcohol content on the density of the wine, and also overlay the quality rating for these. Did the higher alcohol content wines receive better quality ratings because of the alcohol, or because of their lower densities?

Were there any interesting or surprising interactions between features?

The use of citric.acid as a third variable in the plot of quality vs volatile.acidity provided an interesting observation in that lower volatile.acidity values generally had higher citric.acid contents. This was contrary to previous expectations.


Final Plots and Summary

Plot One

Description One

This was one of the strongest correlating variables with the ‘quality’ of the wine - albeit negatively correlated. The box plot shows this most clearly that decreasing volatile acidity correlates with increasing wine quality.

Plot Two

Description Two

This plot was of interest to me, as it demonstrates that the volatile acidity does not correlate with pH. This was somewhat unexpected, as one would expect that the volatile acidity may effect the acidity of the wine itself (measured as the pH) - i.e. more acid gives lower pH. This does not appear to be the case, at least for the volatile acids. I added an alpha factor to overcome the overplotting, and also added a ‘jitter’ as the pH values seemed to only change by 0.1 increments.

Plot Three

Description Three

This plot illustrates the decreasing quality of wines with increasing volatile acidity. However, when the citric acid content is added we see an inverse relationship between volatile acidity and citric acidity, which was unexpected.

Reflection

This dataset was not very large, only 1599 observations over 13 variables. Moreover, the variable of interest to me (‘quality’) ended up being quite difficult to analyse effectively due to the small range of possible integer values it could take (all between 3 and 8).

Despite these challenges, some insights into the effects of various chemical components on wine ‘quality’ could be gained. The largest correlations observed were an increase in quality with alcohol content, and a decrease in quality with volatile acidity.

The better outcomes for higher alcohol contents may be due to a combination of related factors, such as fermentation time, or reduced sweetness. However, the latter was investigated by plotting residual.sugar vs alcohol, and displayed at best a very weak correlation. Not suprisingly, the increasing alcohol content correlated with a decrease in density (due to the inherent physical properties of ethanol and water), but I believe that to be an unlikely contributing variable to the increased quality as judged by the wine tasters. High volatile acidities giving poorer quality ratings makes sense to me, as an acidic wine aroma can be off-putting.

It would be interesting to try to build a model to predict quality based on these two primary variables (i.e. alcohol and volatile acidity), although I expect we would need significantly more data points to get a reliable predictor as the alcohol content can not go up much more without becoming off-putting. Therefore, this model may be quite complex.